Machine learning-based prediction of in-ICU mortality in pneumonia patients

Conventional severity-of-illness scoring systems have shown suboptimal performance for predicting in-intensive care unit (ICU) mortality in patients with severe pneumonia. This study aimed to develop and validate machine learning (ML) models for mortality prediction in patients with severe pneumonia. This retrospective study evaluated patients admitted to the ICU for severe pneumonia between January 2016 and December 2021. The predictive performance was analyzed by comparing the area under the receiver operating characteristic curve (AU-ROC) of ML models to that of conventional severity-of-illness scoring systems. Three ML models were evaluated: (1) logistic regression with L2 regularization, (2) gradient-boosted decision tree (LightGBM), and (3) multilayer perceptron (MLP). Among the 816 pneumonia patients included, 223 (27.3%) patients died. All ML models significantly outperformed the Simplified Acute Physiology Score II (AU-ROC: 0.650 [0.584–0.716] vs 0.820 [0.771–0.869] for logistic regression vs 0.827 [0.777–0.876] for LightGBM 0.838 [0.791–0.884] for MLP; P < 0.001). In the analysis for NRI, the LightGBM and MLP models showed superior reclassification compared with the logistic regression model in predicting in-ICU mortality in all length of stay in the ICU subgroups; all age subgroups; all subgroups with any APACHE II score, PaO2/FiO2 ratio < 200; all subgroups with or without history of respiratory disease; with or without history of CVA or dementia; treatment with mechanical ventilation, and use of inotropic agents. In conclusion, the ML models have excellent performance in predicting in-ICU mortality in patients with severe pneumonia. Moreover, this study highlights the potential advantages of selecting individual ML models for predicting in-ICU mortality in different subgroups.

Data collection, study design, and population. This observational study retrospectively evaluated patients admitted to the ICU for pneumonia at the SNU-SMG Boramae Medical Center between January 2016 and December 2021. The inclusion criteria were: (1) age ≥ 18 years; (2) ICU admission; (3) International Classification of Disease 10th edition code for pneumonia as a major diagnosis or detection of pneumonia on chest computed tomography within 1 week of ICU admission; (4) C-reactive protein (CRP) level ≥ 4 mg/dL; and (5) use of antibiotics for pneumonia. The exclusion criteria were: (1) no oxygen requirement; (2) transfer to the general ward within 3 days; and (3) ICU admission due to more serious medical conditions other than pneumonia.
Baseline data including age, sex, body mass index, smoking history, previous underlying disease, and respiratory comorbidities and clinical features were collected. Clinical features included prognostic scores, vital signs, laboratory examination results, and treatment with antibiotics or steroids.
Main outcome measures. The primary outcome measure was the prognostic accuracy of ML models compared with that of conventional severity-of-illness scoring systems for predicting in-ICU mortality in patients who required ICU admission for severe pneumonia. The secondary outcome measures were as follows: (1) the prognostic accuracy of the ML models; (2) the prognostic accuracy of the ML models in different subgroups; (3) the clinical factors contributing to the prediction of in-ICU mortality in patients admitted to the ICU for severe pneumonia. ICU admission due to severe pneumonia was determined as the presence of at least one major criterion or three minor criteria of the Infectious Disease Society of America/American Thoracic Society guidelines 3 .
The conventional severity-of-illness scoring systems included the SOFA, SAPS II, and APACHE II scores. Among the scoring systems, the best model that showed the strongest performance was used as the baseline comparator. For the statistical and ML models, we tested three popular models: (1) logistic regression with L2 regularization, (2) gradient-boosted decision tree (LightGBM), and (3) multilayer perceptron (MLP).
Data splitting and preprocessing. Variables with more than a 20% missing rate were excluded to generate the available dataset. Approximately 40% of the data were randomly separated with stratification by the outcome and subgrouping variables. The held-out data were used as a test set only for internal validation of the models. The remaining data were used to develop the models as a training set in a tenfold cross-validation scheme.
Missing values were imputed using multivariate imputation by chained equations 44 . Outliers were detected using an isolation forest 45 and subsequently replaced with the closest normal value of the training set. All the variables included in the analysis and their missing rates are listed in Supplementary Table S1.
Variable importance and feature selection. The influence of each variable on the predictive ability of the model was evaluated using the SHapley Additive exPlanations (SHAP) method 46 . To rank the variables, the mean absolute SHAP values were calculated as the relative importance of the variables. A LightGBM was used for the SHAP evaluation 47 . The guiding metric for cross-validation performance was the area under the receiver operating characteristic curve (AU-ROC). www.nature.com/scientificreports/ Model development. Supplementary Fig. S1 presents the workflow diagram for model development.
LightGBM is a gradient-boosted tree-based ensemble model, whereas MLP is a feedforward neural network with a basic architecture comprising fully connected layers. The hyperparameters were tuned using Bayesian optimization to maximize the cross-validation performance. Details of the hyperparameter tuning with package information for all tested models are described in Supplementary Methods and Supplementary Table S2. The models were calibrated using isotonic regression according to the validation data obtained during cross-validation. The methods in the present study were implemented in Python version 3.9.7 (Python Software Foundation, Wilmington, Del, USA), with scikit-learn (version 1.1.2).
Internal validation in different subgroups. The model performance in the different patient subgroups was evaluated using the test set. We prespecified subgroups based on the clinically important phenotypes of pneumonia in the ICU 48 . Subgroup analyses were performed according to (1) the period from hospital admission to ICU admission, (2) age, (3) APACHE II scores, (4) PaO 2 /FiO 2 ratio, (5) history of chronic respiratory disease, (6) history of cerebrovascular accident (CVA) or dementia, (7) MV, and (8) use of vasopressors.
Statistical analysis. Categorical variables were analyzed using the chi-squared test, while continuous variables were analyzed using the independent t-test or Mann-Whitney U test. We evaluated the AU-ROC as an overall performance measure and compared the models using the Delong method 49 . Owing to an imbalance in outcome prevalence, the area under the precision-recall curve (AU-PRC) was also evaluated as another overall performance measure 50 . Furthermore, the performance of the models was evaluated in detail according to sensitivity, positive predictive value, negative predictive value, diagnostic odds ratio, and net reclassification improvement (NRI) 51 at three low false-positive rates (FPR) levels of 10%, 20%, and 30%. The recalibration effects were also evaluated using decision curves, which presented a net benefit against different decision thresholds 52 . The sensitivity at fixed FPR levels was evaluated in the subgroups using the best-performing model for overall performance. Calibration errors were evaluated before and after calibration using the Brier score and calibration curves. The Brier score was calculated as the mean squared error of the predicted probabilities 53 . P values were adjusted using the Benjamini-Hochberg method for multiple comparisons, and the significance level was set at P < 0.05.

Results
Baseline characteristics and clinical features. In total, 816 patients with pneumonia admitted to the ICU were included in the analysis (Fig. 1 www.nature.com/scientificreports/ between the survivor and non-survivor groups. The non-survivor group was more likely to be older, involve a higher number of current smokers, and had a higher number of smoking pack-years among ever-smokers (9.9). Regarding comorbidities, the non-survivor group also included more patients with interstitial lung disease, pulmonary tuberculosis, lung cancer, chronic kidney disease, cerebellar vessel disease, cardiovascular disease, chronic heart failure, and metastatic cancer. Further, this group had higher illness severity scores, including the APACHE II, SOFA, and SAPS II scores.
The clinical characteristics are presented in Table 2. Regarding vital signs, the survivor group had higher systolic, diastolic, or mean blood pressure, while the non-survivor group had faster heart and respiratory rates and lower urine output. The non-survivor group had lower levels of partial pressure of oxygen (PaO 2 ) or carbon dioxide (HCO 3− ), lower levels of oxygen saturation (SpO 2 ), and a lower ratio of PaO 2 /FiO 2 . For laboratory findings, the levels of urea nitrogen, creatinine, alanine aminotransferase, total bilirubin, and alkaline phosphatase (ALP) and the prothrombin time and international normalized ratio (PT-INR) were higher in the non-survivor group. Meanwhile, the survivor group was more likely to be treated with steroids and vasopressors. The comparison results between the training and test sets for the baseline characteristics and clinical features are presented in Supplementary Tables S3 and S4,   Overall performance of ML models according to subgroups. Figure 3 shows the overall performance of the ML models for in-ICU mortality in the different subgroups. The models performed consistently in most of the subgroups. Despite a difference in the in-ICU mortality rate, there was no significant difference in   In the analysis for NRI, the LightGBM and MLP models showed superior reclassification compared with the logistic regression model in predicting in-ICU mortality in all LOS in the ICU subgroups; all age subgroups; all subgroups with any APACHE II score, PaO 2 /FiO 2 ratio < 200; all subgroups with or without history of respiratory disease; with or without history of CVA or dementia; treatment with MV, and use of inotropic agents (Supplementary Table S8). In most of the above subgroups, the LightGBM model showed higher NRI values than the MLP except for the subgroups aged 65-74 years, with an APACHE II score of ≤ 19, with history of CVA or dementia.

Attributable variables with importance plots and SHAP values. The selected predictors of in-ICU
mortality are shown in Fig. 4. From a total of 55 variables, 16 were selected. The selected variables were PaO 2 / FiO 2 ratio, CRP level, lactate level, urine output, initial systolic blood pressure (SBP), white blood cell (WBC) count. Among these variables, the partial SHAP dependence plots for the top six variables with the mean absolute SHAP values are illustrated in Supplementary Fig. S4; those for the other variables in Supplementary Fig. S5. The local interpretability of the LightGBM model is demonstrated in Supplementary Fig. S6, which shows how the model predicts each case of true positive, true negative, false positive, and false negative.

Discussion
Our study evaluated the prognostic accuracy of ML models compared with that of conventional severity-of-illness scoring systems for predicting in-ICU mortality in patients with severe pneumonia. All ML models showed excellent performance in predicting in-ICU mortality and were superior to SAPS II. In addition, when the ML models were applied in the different subgroups, the LightGBM and MLP models showed superior reclassification compared with the logistic regression model in predicting in-ICU mortality in all LOS in the ICU subgroups; all age subgroups; all subgroups with any APACHE II score, PaO 2 /FiO 2 ratio < 200; all subgroups with or without history of respiratory disease; with or without history of CVA or dementia; treatment with MV, and use of inotropic agents. Furthermore, the LightGBM model showed higher NRI values than the MLP in most of the above subgroups, except for the subgroups aged 65-74 years, with an APACHE II score of ≤ 19, with history of CVA or dementia. Therefore, ML models have the potential to improve in-ICU mortality prediction in patients with severe pneumonia admitted to the ICU. Moreover, this study shows the potential advantages of individual ML models for predicting in-ICU mortality in different subgroups of patients with severe pneumonia admitted to the ICU.
CURB-65 and PSI are the most commonly used clinical severity-of-illness scoring systems for patients with community-acquired pneumonia. However, CURB-65 and PSI have a limitation in that CURB-65 has low sensitivity, while PSI has low specificity for mortality 54 . The Clinical Pulmonary Infection Score is a well-validated prediction model for the development of ventilator-associated pneumonia. However, its predictive performance for mortality is inferior to that of the APACHE II score 55 . Given the limitations of clinical severity-of-illness scoring systems, the usefulness of ML models has recently been studied to predict mortality in patients with pneumonia. The results show that various ML models outperform CURB-65 38,39,56 and PSI 57 for predicting mortality in patients with severe community-or hospital-acquired pneumonia. However, few previous studies have used ML models to predict the prognosis of patients with severe pneumonia admitted to the ICU. One small www.nature.com/scientificreports/ study reported that an ML approach had better performance than APACHE II and PSI for predicting mortality in critically ill influenza patients 58 . In our study, SAPS II showed a numerically higher AU-ROC than APACHE II and SOFA for predicting in-ICU mortality on the first day of ICU admission. Importantly, the ML models www.nature.com/scientificreports/ outperformed SAPS II, which suggested that the ML model can provide more accurate information for optimal decision-making based on the estimated probability of mortality. We selected logistic regression, LightGBM, and MLP as the ML predictive models for mortality in patients with severe pneumonia based on several considerations. Logistic regression is a well-established and widely used ML model for binary classification tasks. It offers a simple and interpretable approach to modeling the relationship between predictor variables and the outcome as well as performance of ML [59][60][61] . LightGBM is a gradientboosted decision tree algorithm that has gained popularity for its high performance and efficiency 62 . Several studies have demonstrated the favorable predictive value of LightGBM in the field of medicine [63][64][65] . Another study found that LightGBM had the best predictive ability among other ML models including XGBoost, logistic regression, and naïve Bayes 66 . MLP-based models are effective in capturing nonlinear relationships, making them ideal candidates for complex and multifactorial disease classification including in stroke 67,68 when compared to conventional statistical modeling. Moreover, both LightGBM and MLP have been used in many clinical studies [69][70][71] , demonstrating their extensive applicability and promising predictive performance. In the present study, the ML models outperformed conventional severity-of-illness systems. Scoring systems use a limited number of variables, which might restrict their predictive power in individual patients 72 . ML models are capable of utilizing high-dimensional data, and this could account for their superior performance to conventional scoring systems.
In this study, the LightGBM model showed the highest predictive performance with respect to NRI at 10% FPR. This result supports that decision-tree-based models could be more beneficial than logistic regression models for predicting in-ICU mortality in pneumonia patients at a high cut-off point of 90% specificity. ML models have a strength in capturing the nonlinear relations between the features and the predicted outcomes. We found notable non-linear relationships between in-ICU mortality and several selected variables, including PaO 2 , WBC count, pH, initial pulse rate, lymphocyte, HCO 3− , and PaCO 2 . This could be the reason for the lower performance of the logistic regression model, a generalized linear model, in predicting the in-ICU mortality of patients with severe pneumonia. Although cross-validation results were not specified to the FPR level, the MLP model had the largest difference in the AU-ROC value between internal validation and cross-validation. This indicates that compared with the other models, the MLP model might be more relatively overfitted to the training set.
Contrary to the performance in different subgroups at the 10% FPR level, our results in the entire test set demonstrated no significant difference in AU-ROC between the ML models and the logistic regression model. In partial dependence plots of the variables that contributed the most to model predictions, linear relationships with in-ICU mortality were observed, especially in those of the PaO 2 /FiO 2 ratio, urine output, and initial SBP. This might be the reason for the lack of statistical significance in these differences. The SHAP model was used to determine the important influencing factors of in-ICU mortality in the ML models, and the results were similar to previous studies: PaO 2 /FiO 2 73,74 , CRP levels 75 , urine output 76 , initial SBP 73 , PaO 2 73,77 , and leukopenia 78,79 or leukocytosis 80,81 .
In addition, using the SHAP model, we found that higher ALP levels and prolonged PT-INR were associated with a higher risk of in-ICU mortality. ALP can be elevated as an acute-phase reaction in acute infections 82 . In  www.nature.com/scientificreports/ community-acquired pneumonia, elevated ALP levels were not associated with mortality 83 . However, in critically ill patients with septic acute kidney injury (AKI), elevated ALP levels are associated with mortality 84 . In our study, septic shock was found in > 40% of the critically ill patients with pneumonia, and AKI was a common condition. It appears that ALP is related to the severity of impaired renal function through systemic inflammation caused by pneumonia rather than the severity of pneumonia itself. With respect to prolonged PT-INR, substantial coagulation abnormalities are commonly observed in patients with sepsis or pneumonia 85 . The excessive production of thrombogenic tissue factors in sepsis pneumonia compared with low levels of tissue factors under normal conditions 86 leads to the development of systemic coagulopathy during the period of pneumonia 87 . Our study had some limitations. First, as the models were developed using data retrospectively collected in a single center and were not externally validated, the results had limited generalizability. Although our study provides valuable insights into the performance of the ML models, further studies are needed to assess the generalizability and real-time applicability of these ML models in predicting in-ICU mortality in patients with severe pneumonia. Furthermore, these studies should include robust external validation using independent datasets and evaluation of the model performance in prospective clinical practice. Second, owing to the small sample size and the inclusion of patients admitted within a long study period of over 6 years, there is a possibility of heterogeneity with respect to patient characteristics, treatment measures, and potential biases. Thus, it might be challenging to clearly establish the usefulness of ML, which is a non-parametric algorithm. Furthermore, six subgrouping variables were adopted. A large number of stratification variables with a small sample size could lead to optimistic results in the internal validation of the models.

Conclusion
Compared to conventional severity-of-illness scoring systems, the ML models of LightGBM, MLP, and logistic regression have better predictive performance for in-ICU mortality in patients with severe pneumonia. Moreover, this study shows the potential advantages of selecting individual ML models for predicting in-ICU mortality in different subgroups of patients with severe pneumonia.

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request. www.nature.com/scientificreports/